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Abstract. We study the problem of allocating multiple users to a set 
of wireless channels in a decentralized manner when the channel quali- 
ties are time- varying and unknown to the users, and accessing the same 
channel by multiple users leads to reduced quality due to interference. 
In such a setting the users not only need to learn the inherent channel 
quality and at the same time the best allocations of users to channels 
so as to maximize the social welfare. Assuming that the users adopt a 
certain online learning algorithm, we investigate under what conditions 
j/j ' the socially optimal allocation is achievable. In particular we examine 

O ■ the effect of different levels of knowledge the users may have and the 

amount of communications and cooperation. The general conclusion is 
that when the cooperation of users decreases and the uncertainty about 
^. . channel payoffs increases it becomes harder to achieve the socially opti- 

pf") ' rnal allocation. 
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1 Introduction 

In this paper we study the dynamic spectrum access and spectrum sharing prob- 
rS \ lem in a learning context. Specifically, we consider a set of N common channels 

j^ ■ shared by a set of M users. A channel has time varying rate r(t), and its statis- 

tics are not completely known by the users. Thus each user needs to employ 
some type of learning to figure out which channels are of better quality, e.g., in 
terms of their average achievable rates. At the same time, simultaneous use of 
the same channel by multiple users will result in reduced rate due to interference 
or collision. The precise form of this performance degradation may or may not 
be known to the user. Thus the users also need to use learning to avoid excess 
interference or congestion. Furthermore, each user may have private information 
that is not shared, e.g., users may perceive channel quality differently due to 
difference in location as well as individual modulation/coding schemes. 

Without a central agent, and in the presence of information decentralization 
described above, we are interested in the following questions: (1) for a given 



common learning algorithm, does the multiuser learning process converge, and 
(2) if it does, what is the quality of the equilibrium point with respect to a 
globally optimal spectrum allocation scheme, one that could be computed for a 
global objective function with full knowledge of channel statistics as well as the 
users' private information. 

A few recent studies have addressed these questions in some special cases. 
For instance, in [3] it was shown that learning using a sample-mean based index 
policy leads to a socially optimal (sum of individual utilities) allocation when 
channels evolve as iid processes and colliding players get zero reward provided 
that this optimal allocation is such that each user occupies one of the M best 
channels (in terms of average rates) . This precludes the possibility that not all 
users may have the same set of M best channels, and that in some cases the best 
option is for multiple users to share a common channel, e.g., when N < M . 

In this study we investigate under what conditions the socially optimal al- 
location is achievable by considering different levels of communication (or coop- 
eration) allowed among users, and different levels of uncertainty on the channel 
statistics. The general conclusion, as intuition would suggest, is that when the 
cooperation of users increases and the channel uncertainty decreases it becomes 
easier to achieve the socially optimal welfare. Specifically, we assume that the 
rate (or reward) user i gets from channel j at time t is of the form rj (t)gj (rij (£)) 
where rj(t) is the rate of channel j at time t, rij(t) is the number of users using 
channel j at time t, and gj is the user independent interference function (IF) for 
channel j. This model is richer than the previously used models |3|14|16| since 
Tj (t) can represent environmental effects such as fading or primary user activity, 
while gj captures interactions between users. We consider the following three 
cases. 

In the first case (CI), each channel evolves as an iid random process in time, 
the users do not know the channel statistics, nor the form of the interference, nor 
the total number of users present in the system, and no direct communication 
is allowed among users. A user can measure the overall rate it gets from using 
a channel but cannot tell how much of it is due to the dynamically changing 
channel quality (i.e., what it would get if it were the only user) vs. interference 
from other users. In this case, we show that if all users follow the Exp3 algorithm 
[7] then the channel allocation converges to a set of pure Nash equilibria (PNE) 
of a congestion game defined by the IFs and mean channel rates. In this case a 
socially optimal allocation cannot be ensured, as the set of PNE are of different 
quality, and in some cases the socially optimal allocation may not be a PNE. 

In the second case (C2), each channel again evolves as an iid random process 
in time, whose statistics are unknown to the user. However, the users now know 
the total number of users in the system, as well as the fact that the quantitative 
impact of interference is common to all users (i.e., user independent), though 
the actual form of the interference function is unknown. In other words the rate 
of channel j at time t is perceived by user i as hj(t,rij(t)) so user i cannot 
distinguish between components rj(t) and gj(rij(t)). Furthermore, users are now 
allowed minimal amount of communication when they happen to be in the same 



channel, specifically to find out the total number of simultaneous users of that 
channel. In this case we present a sample-mean based randomized learning policy 
that achieves socially optimal allocation as time goes to infinity, with a sub-linear 
regret over the time horizon with respect to the socially optimal allocation. 

In the third case (C3), as in case (C2) the users know the total number 
of users in the system, as well as the fact that the IF is user independent and 
decreasing without knowing the actual form of the IF. However, the channels are 
assumed to have constant, albeit unknown, rates. We show that even without 
any communication among users, there is a randomized learning algorithm that 
achieves the socially optimal allocation in finite time. 

It's worth pointing out that in the settings outlined above, the users are 
non- strategic, i.e., each user simply follow a pre-set learning rule rather than 
playing a game. In this context it is reasonable to introduce minimal amount of 
communication among users and assume they may cooperate. It is possible that 
even in this case the users may not know their IF but only the total rate they 
get for lack of better detecting capabilities (e.g., they may only be able to detect 
the total received SNR as a result of channel rate and user interference) . 

Online learning by a single user was studied by [1 4 6115] . in which sample- 
mean based index policies were shown to achieve logarithmic regret with respect 
to the best single-action policy without a priori knowledge of the statistics, and 
are order-optimal, when the rewards are given by an iid process. In [512 1 22 
Markovian rewards are considered, with [22] focusing on restless reward pro- 
cesses, where a process continues to evolve according to a Markov chain regard- 
less of the users' actions. In all these studies learning algorithms were developed 
to achieve logarithmic regret. Multi-user learning with iid reward processes have 
been studied in a dynamic spectrum context by [3111116] . with a combinatorial 
structure adopted in [IT] , and with collision and random access models in [3116] . 
In |13) . convergence of multi-user learning with Exp3 algorithm to pure Nash 
equilibrium is investigated under the collision and fair sharing models. In the col- 
lision model, when there is more than one user on a channel all get zero reward, 
whereas in the random access model one of them, selected randomly, gets all the 
reward while others get zero reward. In the fair sharing model, a user's utility is 
inversely proportional to the number of users who are on the same channel with 
the user. Note that these models do not capture more sophisticated communi- 
cation schemes where the rate a user gets is a function of the received SNR of 
the form gj(n) = fj( N + , '_ 1|p ) = where Pt is the nominal transmit power of 
all users and TVo the noise. Moreover, in the above studies the socially optimal 
allocation is a rather simple one: it is the orthogonal allocation of users to the 
first M channels with the highest mean rewards. By contrast, we model a more 
general interference relationship among users, in which an allocation with users 
sharing the same channel may be the socially optimal one. The socially optimal 
allocations is not trivial in this case and additional mechanisms may be needed 
for the learning algorithms to converge. 

All of the above mentioned work assumes some level of communication be- 
tween the users either at the beginning or during the learning. If we assume 



no communication between the users, achieving the socially optimal allocation 
seems very challenging in general. Then one may ask if it is possible to achieve 
some kind of equilibrium allocation. Kleinberg et. al. p3] showed that it is pos- 
sible for the case when the channel rates are constant and the users do not 
know the IFs. They show that when the users use aggregate monotonic selection 
dynamics, a variant of Hedge algorithm |10] , the allocation converges to weakly 
stable equilibria which is a subset of Nash equilibria (NE) of the congestion game 
defined by the IFs. They show that for almost all congestion games weakly stable 
equilibria is the same as PNE. 

Other than the work described above [2] considers spatial congestion games, a 
generalization of congestion games and gives conditions under which there exists 
a PNE and best-response play converges to PNE. A mechanism design approach 
for socially optimal power allocation when users are strategic is considered in 

The organization of the remainder of this paper is as follows. In Sect. [2] we 
present the notations and definitions that will be used throughout the paper. 
In Sects. EllH[5]we analyze the cases stated in (CI), (C2), (C3) and derive the 
results respectively. Conclusion and future research is given in Sect. [SJ 

2 Preliminaries 

Denote the set of users by .M = {1,2, ... ,M}, and the set of channels N = 
{1,2,..., N}. Time is slotted and indexed by t = 1,2,... and a user can select 
a single channel at each time step t. Without loss of generality let Vj{t) G [0, 1] 
be the rate of channel j at time t such that {rj(t)} t =i,2.... is generated by an 
iid process with support [0, 1] and mean fij e [0, 1]. Let gj : N — > [0, 1] be the 
interference function (IF) on channel j where gj(n) represents the interference 
when there are n users on channel j. We express the rate of channel j seen by 
a user as hj(t) = rj(t)gj(rij(t)) when a user does not know the total number of 
users rij(t) using channel j at time t as in cases (CI) and (C3). When a user knows 
rij(t), we express the rate of channel j at time t as hj >n .u)(t) = r j(t)dj( n j(t)) as 
in case (C2). Let Si = N be the set of feasible actions of user i and cr, £ Si be the 
action, i.e., channel selected by user i. Let S = S\ x £2 x . . . x Sn = A/" M be the 
set of feasible action profiles and a = {01, <72, . . . , <jm } 6 S be the action profile 
of the users. Throughout the discussion we assume that the action of player i at 
time t, i.e., crf*(t) is determined by the policy 7Tj. When 7Tj is deterministic, 7T;(i) 
is in general a function from all past observations and decisions of user i to the set 
of actions Si . When 7r, is randomized, 7^ (t) generates a probability distribution 
over the set of actions Si according to all past observations and decisions of user 
i from which the action at time t is sampled. Since the dependence of actions to 
the policy is trivial we use o~i{t) to denote the action of user i at time t, dropping 
the superscript 7Tj. 

Let Kj(a) be the set of users on channel j when the action profile is a. Let 

A* = argmax CTe5 J] i= iA i ^5(Ti(^ ! (c r )) = argmax CTG5 J2j=i V>j K iW)9i(. K jW)) 
be the set of socially optimal allocations and denote by a* any action profile 



that is in the set A*. Let v* denote the socially optimal welfare, i.e., v* = 
J2i=i Ho-r g^r (K<rr (<?*)) and Vj denote the payoff a user gets from channel j 
under the socially optimal allocation, i.e., v* = /j,jgj(Kj(a*)) if Kj{a*) ^ 0. 
Note that any permutation of actions in a* is also a socially optimal allocation 
since IFs are user- independent. 

For any policy 7r, the regret at time n is 



R(n) = nv* — E 



M 



Yl Yl r M«) (%M*) ( K *i(t) (<K*))) 



,t=l i=l 



where expectation is taken with respect to the random nature of the rates and 
the randomization of the policy. Note that for a deterministic policy expectation 
is only taken with respect to the random nature of the rates. For any randomized 
policy 7Tj, let Pi(t) = (pn(t),Pi2(t), . . . ,PiN{t)) be the mixed strategy of user i at 
time t, i.e., a probability distribution on {1, 2, . . . , N}. For a profile of policies 
■n = [7Ti,7r 2 , . . . ,7r M ] for the users let p(t) = (pi(t) T , p 2 {t) T , . . .p M (t) T ) T be 
the profile of mixed strategies at time t, where Pi{t) T is the transpose of Pi(t). 
Then <7,(£) is the action sampled from the probability distribution Pi(t). The 
dependence of p to 7r is trivial and not shown in the notation. 

3 Allocations Achievable with Exp3 Algorithm (Case 1) 

We start by defining a congestion game. A congestion game |17ll8j is given by 
the tuple (M,Af, (Ei) ie M, (^OjeJv), where M denotes a set of players (users), 
N a set of resources (channels), Si C 2^ the strategy space of player i, and 
hj : N — > R a payoff function associated with resource j, which is a function of 
the number of players using that resource. It is well known that a congestion 
game has a potential function and the local maxima of the potential function 
corresponds to PNE, and every sequence of asynchronous improvement steps is 
finite and converges to PNE. 

In this section we relate the strategy update rule of Exp3 [7] under assump- 
tions (CI) to a congestion game. Exp3 as given in Fig. Q] is a randomized al- 
gorithm consisting of an exploration parameter 7 and weights Wy that depend 
exponentially on the past observations where i denotes the user and j denotes 
the channel. Each user runs Exp3 independently but we explicitly note the user 
dependence because a user's action affects other users' updates. 

At any time step before the channel rate and user actions are drawn from 
the corresponding distributions, let Rj denote the random variable correspond- 
ing to the reward of the jth channel. Let Gy = gj(l + K'j(i)) be the random 
variable representing the payoff user i gets from channel j where K'Ji) is the 
random variable representing the number of users on channel j other than user 
i. Let Uij = RjGij and Uy = Ej[E_i[Uij]} be the expected payoff to user i by 
using channel j where E_i represents the expectation taken with respect to the 
randomization of players other than i, Ej represents the expectation taken with 
respect to the randomization of the rate of channel j. Since the channel rate is 
independent of users' actions Uij = Hjijij where <?y = E-i[Gij]. 



Exp3 (for user i) 
1: Initialize: 7 £ (0, 1), ?%(£) = l,Vj 6 W, £ = 1 



while i > do 



^(*)-(l-7) "^ 



9 
10 

11 
12 
13 
14 



Er=i^(t) N 

Sample CTj(£) from the distribution on pi(i) = [pii(i),Pi2(i), ■ • ■ ,PiN{t)] 
Play channel <Ji(t) and receive reward h ai i t \(t) 
for j = 1,2,..., AT do 
if j = <7j(t) then 

Set^i+l)=^(i)ex P (2^gf^ 
else 

Set Wy(< + 1) = u>ij(t) 
end if 
end for 
i = t + 1 
end while 



Fig. 1. pseudocode of Exp3 



Lemma 1. Under (CI) when all players use ExpS, the derivative of the continuous- 
time limit of ExpS is the replicator equation given by 

1 N 

iii = Jf(l*jPij)'52Pil(9ij -9il) ■ 

1=1 



Proof. Note that 

(l-7)w«(*) = !>«(*) (*>«(*)-]£) ' 0) 



N 



1 = 1 



We consider the effect of user i's action Oi(t) on his probability update on channel 
j. We have two cases: <7j(i) = j and er ? ;(£) ^ j. Let AJ^ = exp ( 2. . ffijy ) • 

Consider the case (7j(t) = j. 



d-7K(tK'' 



**(* + !) = ^ ' ,',-,.« „ + 77 • (2) 



Substituting (J) into © 

llli^t) ( Pij (t)-%)A7j , 7 



p«(t + l) 



*_ 



(M*)-#)^ , 7 

The continuous time process is obtained by taking the limit 7 — > 0, i.e., the 
rate of change in p^ with respect to 7 as 7 — > 0. Then, dropping the discrete 
time script t, 

p tj = hm 



7^0 d"f 

(tK? + (pa - *) jglK?) (1 + ^ (^ 



= lim 

7->0 



7,t 



2 



(1 + ^(^-1))' 
Py NJ A i,j \(l-~r) 3A i,3 + 1-7 \N A i,j)j 1 



(1 + ^(^-1))' 

TV 
Consider the case <Ti(t) = k ^ j. Then, 



N 



Pij(t+1) = 



(1 - i)w tl {i) + j_ 

N 



PijM-N , 7 

! ! PifcW--^ ^7,t iA TV" 



Thus 



#(1 



Pij 



TV r ' 1-7 



frs- 1 )) 



= lim 

7^0 ^ ^ Pifc -X ^7,* _ 1 



(3) 



1 — 7 l i,fe 
IPy ivJ V (1-7)^ 4 > fc 1-7 \N A i,kJJ 1 

+ (i + ^^S-O)" + " 

~ iv - ■ ^ j 

Then from ^ and ((3]), the expected change in p^ with respect to the prob- 
ability distribution pi of user i over the channels is 

pij = Ei\pij] = —p^ Y^ Pii(Uij -Uu). 



Taking the expectation with respect to the randomization of channel rates and 
other users' actions we have 

£ij = EjlE-ifcj]] 



TV 

lEAf-{]} 



1 N 



N 

'~ij — gu) 



N 

1=1 



Lemma Q] shows that the dynamics of a user's probability distribution over the 
actions is given by a replicator equation which is commonly studied in evolu- 
tionary game theory |19|20j . With this lemma we can establish the following 
theorem. 

Theorem 1. For all but a measure zero subset of [0, 1] 2N from which the /j,j 's 
and gj 's are selected, when 7 in ExpS is arbitrarily small, the action profile 
converges to the set of PNE of the congestion game (M.,N, (<Sj)ig.M, (p-jgj)jeJ\r)- 

Proof. Because the replicator equation in Lemma [T] is identical to the replicator 
equation in [14], the proof of converge to PNE follows from [14]. Here, we briefly 
explain the steps in the proof. Defining the expected potential function to be 
the expected value of the potential function <j> where expectation is taken with 
respect to the user's randomization one can show that the solutions of the repli- 
cator equation converges to the set of fixed points. Then the stability analysis 
using the Jacobian matrix yields that every stable fixed point corresponds to a 
Nash equilibrium. Then one can prove that for any stable fixed point the eigen- 
values of the Jacobian must be zero. This implies that every stable fixed point 
corresponds to a weakly stable Nash equilibrium strategy in the game theoretic 
sense. Then using tools from algebraic geometry one can show that almost every 
weakly stable Nash equilibrium is a pure Nash equilibrium of the congestion 
game. 

We also need to investigate the error introduced by treating the discrete time 
update rule as a continuous time process. However, by taking 7 infinitesimal 
we can approximate the discrete time process by the continuous time process. 
For a discussion when 7 is not infinitesimal one can define approximately stable 
equilibria |14j . □ 

The main difference between Exp3 and Hedge [H] is that in Exp3 users 
do not need to observe the payoffs from the channels that they do not select, 
whereas Hedge assumes complete observation. In addition to that, we considered 
the dynamic channel rates which is not considered in |14j . 



4 An Algorithm for Socially Optimal Allocation with 
Sub-linear Regret (Case 2) 

In this section we propose an algorithm whose regret with respect to the so- 
cially optimal allocation is 0(n ™ ) for 7 > arbitrarily small. Clearly 
this regret is sublinear and approaches linear as the number of users M in- 
creases. This means that the time average of the sum of the utilities of the play- 
ers converges to the socially optimal welfare. Let K. — {k = (k±, &2, . . . , fcjv) : 
kj > 0, Vj <E N , fei + fe + • • • + kN = M} denote an allocation of M users 
to N channels. Note that this allocation gives only the number of users on 
each channel. It does not say anything about which user uses which chan- 
nel. We assume that the socially optimal allocation is unique up to permu- 
tations so k* = argmax/jg^ ^ . 1 /j,jkjgj(kj) is unique. We also assume the 
following stability condition of the socially optimal allocation. Let Vj{kj) = 
Hjgj{kj). Then the stability condition says that argmaxfcgjc $^7=1 kjVjihj) = ^'* 
if \vj(k) - Vj(k)\ < e,\fk 6 {1,2, ..., M},\fj G N, for some e > 0, where 
i'j : N — > M is an arbitrary function. Let T z k (t) be the number of times user 
i used channel j and observed k users on it up to time t. We refer to the 
tuple (j,k) as an arm. Let n l - k {t) be the time of the ith observation of user 
i from arm (j,k). Let u l , k (t) be the sample mean of the rewards from arm 
(j,k) seen by user i at the end of the ith play of arm (j,k) by user i, i.e., 
u },fcW = (hj,k( n ),k0-)) + • • • + hj,k{n % jk {t)))/t. Then the socially optimal allo- 
cation estimated by user i at time t is k l *{t) = argmaxfcgx; Si=i ^j u ) fc W- The 
pseudocode of the Randomized Learning Algorithm (RLA) is given in Fig. [2] At 
time t RLA explores with probability l/(i™" - xr) by randomly choosing one of 
the channels and exploits with probability 1 — 1/(£2jvj ~ m ) by choosing a channel 
which is occupied by a user in the estimated socially optimal allocation. 

The following will be useful in the proof of the main theorem of this section. 



Lemma 2. Let JQ, i — 1, 2, . . . be a sequence of independent Bernoulli random 
variables such that Xi has mean qi with < qi < 1. Let Xk = \ 5"^— 1 Xj , 
Qk = t Sj=i 1i- Then for any constant e > and any integer n > 0, 

P (X n - q n < -e) < e~ 2 ™ 2 . (5) 

Proof. The result follows from symmetry and [5]. D 

Lemma 3. For p > Q,p ^ 1 

^ + 1 ^- 1 <Vl<l + ^^i (6) 



\-p *—< V> 1-p 

1 t=l 



Proof. See [8]. 



RLA (for user i) 


1 


Initialize: < 7 « 1, u) k (l) = 0,2^(1) = 0, Vj € A/", & e A4, i = 1, 




sample (7j(l) uniformly from A/". 


2 


while i > do 


3 


play channel <Ji(t), observe l(t) the total number of players using channel 




o-j(i) and reward fi. CTi (t),i(t)(*)- 


4 


Set^ (t) , l(t) (t+l)=^ i(t)iI(t) (*) + l. 


5 


Set Tj,(t + 1) = Tjj(t) for (j, I) ? (a t (t), l(t)). 


6 


Cn4 . „,i u , n _ T i I (t),i(t)(*X,(t),i(t)(*)+ /t -i(*),'(t)(*) 


7 


Set «* s ,(t + 1) = 4,(t) for (j,l) ± (*i(t),l(t)). 


8 


Set k"(t + 1) = argmax fcGJC Eji *i«J-, fcj (*+!)■ 


9 


Set 0**(i+l) to be the set of channels used by at least one user in k* l (t+l). 


10 


Draw it randomly from Bernoulli distribution with P(it = 1) = 
1 


t (l/2M)- 7 /M 


11 


if it = then 


12 


if a l {t) e6*{t + 1) and l(t) = k' l *{t + 1) then 


13 


<Ti(t+l) = *i(t) 


14 


else 


15 


<7j(i+l) is selected uniformly at random from the channels in 9*(t+l). 


16 


end if 


17 


else 


18 


Draw (Tj(t + 1) uniformly at random from A/". 


19 


end if 


20 


t = t + l 


21 


end while 



Fig. 2. pseudocode of RLA 



Theorem 2. When all players use RLA the regret with respect to the socially 
optimal allocation is 0(n sja ) where 7 can be arbitrarily small. 



Proof. Let H{t) be the event that at time t there exists at least one user that 
computed the socially optimal allocation incorrectly. Let w be a sample path. 
Then 

n n M 

^I(,ei/(t))<^^ir(t)/F) 

t=\ t=l i=l 

(t,i,j',0=(i.i.i,i) 



a hit 



(n,M,N,M) , 

E /(K,(r;,W)-^(OI>^n/W^ 
(t,i,i,i)=(i,i,i,i) ^ 

(n,M,N,M) / -, 

+ E 7 (l«Mi(*)) " V M > *,Tj tl (t) < ^ J 

f •*„'„* J \ — / 1 1 1 1 \ ^ / 



(7) 



Let el k (t) 
fore, 



(t,i,j,l)=(i, 1,1,1) 

alnt T^U^^ T's f-A ^ a In £ 



T ;,J*) 



Then T] ;fe (t) > 



=> e > 



a hit 



= ej. fc (t). There- 



/ K-,m,(i))-^(OI>e,r;,(i)> 



a hit 



/(^K- i (T; ii (t))-^(0|>e,T; ii (t)<^ 

Then, continuing from ([7]). 

E/(«eff(t)) 
*=i 

(n,M,iV,Af) 



<j(K-,i(2i,»(*))-^(0i>4,i(*)) 

/„,.■ / x a hi t N 



a hit 



< E ( 7 (i«Mi(*)) - «i(0i > 4,*(*)) + 7 ( T L(*) < - 

(t,i,i,0=(i,i,i.i) ^ ^ 

Taking the expectation over ©, 

n 

E E^e^W) 
.«=i 

(n,Af,JV,Af) 

< e ^(Ki(2i,i(*))-«i(0i>4,i(*)) 

(t,i,j,o=(i, i,i,i) 

{n,M,N,M) . 

+ E p (^(*)< 

(t,»,i,j)=(i, i,i,i) v 
We have 

p(I«5,j(^,i(*))-« 3 -(0I>4,i(*)) 

= p (<4(i^(i)) - Vi (o > 4,(4)) + p k,^,^)) - «i(o < - e },(t)) 



alnt 



(9) 



sj.iW.iW) 



X/C 7 !^)) 



p i '-iv:':- - ^-(o > £ },(t) ] + p ( J ^ 3 ;;;-" - ^(o < -&(<) 



*L(*) 



< 2 exp - 



2(Tj/t)) 2 (e;. ( (t)) 5 



2 exp 



2TJ,(t)alnt S 



Tj/t) y ^ T^(t) 

where (fit)]) follows from the Chernoff-Hocffding inequality. 



^ . (10) 



Now wc will bound P fej(t) < ^Y Let TR l }l {t) be the number of time 

steps in which player i played channel j and observed I users on channel j in the 
time steps where all players randomized up to time t. Then 



Thus 



r m , / n o, In t , , „„„■,, a In £ , 
{^ : 1^(4) < -5-} c {w : Ti^(i) < -5-}, 



„„• , . a In £ \ „ /„ • , . a In t 



(11) 



(12) 



Now we define new Bernoulli random variables JQ ; (s) as follows: X|;(s) = 1 
if all players randomize at time s and player i selects channel j and observes I 
players on it according to the random draw. X' 1 - t (s) = else. Then TR^it) = 

(M — l\(M+N — l—2\ 

EUi X lM- P ( X li( s ) = 1) = PsPl where Pl = [ - 
Let s t = Es=i s d/2)-7 Thcn 



JV-1 J 



and p s 



,(l/2)-7 



P Tff,(t)< 



a hit 



p fTP^)_ m<ami PfcSt 



< P 



'TP^(t) 



ie 2 



PkSt a hit p fe (t + l)( 1 / 2 )+''-l 



tr 2 



*((l/2) + 7) 



(13) 



where ([TU)) follows from Lemma [3] Let r(M, TV, e, 7, 7', a) be the time that for 
allfce{l,2,...,M}. 



p fc (f + l)(l/2)+7_i _ obit (1/2)+ y 

£((1/2) +7) £e 2 " 



(14) 



where < 7' < 7. Then for all t > t(M,N, 6,7,7', a) dH wi U hold since RHS 
increases faster than LHS. Thus we have for t > t(M, N, e, 7, 7', a) 



P 



< P 



< e~ 



rPU*) PfeS t Olnt Pfcft + l^-Mr.i 



< 



f 2-y'-l 



£ 



£e 2 



*((l/2)+7) 



r -R},l(*) PfeSt ^ + -(l/2)+ 7 ' 



< e 



-21nt 



1 

t 2 ' 



(15) 



Let a = 1. Then continuing from © by substituting (fTU)) and (|15l) we have 



P 



E^eflW) 



<Af 2 iY(r(M,iV,e,7,7',l) + 3^ij 



(16) 



Thus we proved that the expected number of time steps in which there exists 
at least one user that computed the socially optimal allocation incorrectly is 
finite. Note that because RLA explores with probability . 1/2 m-~,/m , the expected 
number of time steps in which all the players are not randomizing up to time n 

is 

t^-^- W ^^) t y-t lm ^-Oin--^ ) . ,17) 

Note that players can choose 7 arbitrarily small, increasing the finite regret due 
to t(M, N, e, 7,7', 1). Thus if we are interested in the asymptotic performance 
then 7 > can be arbitrarily small. 

Now we do the worst case analysis. We classify the time steps into two. Good 
time steps in which all the players know the socially optimal allocation correctly 
and none of the players randomize excluding the randomizations done for settling 
down to the socially optimal allocation. Bad time steps in which there exists a 
player that does not know the socially optimal allocation correctly or there is 
a player that randomizes excluding the randomizations done for settling down 
to the socially optimal allocation. The number of Bad time steps in which there 
exists a player that does not know the socially optimal allocation correctly is 
finite while the number of time steps in which there is a player that randomizes 
excluding the randomizations done for settling down to the socially optimal 
allocation is 0(n ™ ). The worst case is when each bad step is followed by 
a good step. Then from this good step the expected number of times to settle 

down to the socially optimal allocation is I 1 — fM +- i -i\ ) / I /m+z*-i\ ) where 

V \ «*-i 1 J \\ «* 7 i 1 J 
z* is the number of channels which has at least one user in the socially optimal 

allocation. Assuming in the worst case the sum of the utilities of the players is 

when they are not playing the socially optimal allocation we have 

1- 1 



2M-1 + 2-T 

U(n 2A ' ) 



fM 2 AMr(M,iV,e,7,7',l) + 3^1j + 0(» 



R(n)< ( 7'~ rl) ( M 2 N ( t(M, N, e , 7, 7', 1) + 3 V 1 1 + 0(n 2M ™ t2a ) 



□ 



Note that we mentioned earlier, under a classical multi-armed bandit prob- 
lem approach as cited before |3I4I5I15I16I21I22) . a logarithmic regret O(logn) is 
achievable. The fundamental difference between these studies and the problem 
in the present paper is the following: Assume that at time t user i selects channel 
j. This means that i selects to observe an arm from the set {(j, k) : k £ M} but 
the arm assigned to i is selected from this set depending on the choices of other 
players. 

Also note that in RLA a user computes the socially optimal allocation accord- 
ing to its estimates at each time step. This could pose significant computational 



effort since integer programming is NP-hard in general. However, by exploiting 
the stability condition on the socially optimal allocation a user may reduce the 
number of computations; this is a subject of future research. 

5 An Algorithm for Socially Optimal Allocation (Case 3) 

In this section we assume that gj(n) is decreasing in n for all j G N. For 
simplicity we assume that the socially optimal allocation is unique up to the 
permutations of a* . When this uniqueness assumption does not hold we need a 
more complicated algorithm to achieve the socially optimal allocation. All users 
use the Random Selection (RS) algorithm defined in Fig. [3J RS consists of two 
phases. Phase 1 is the learning phase where the user randomizes to learn the 
interference functions. Let Bj(t) be the set of distinct payoffs observed from 
channel j up to time t. Then the payoffs in set Bj(t) can be ordered in a de- 
creasing way with the associated indices {1,2,..., \Bj{t)\}. Let 0{Bj(t)) denote 
this ordering. Since the IFs are decreasing, at the time \Bj(t)\ = M, the user has 
learned gj. At the time | uf =1 Bj(t)\ = MN, the user has learned all IFs. Then, 
the user computes A* and phase 2 of RS starts where the user randomizes to 
converge to the socially optimal allocation. 



Random Selection (RS) 

1: Initialize: t = 1, b = 0, Bj(l) = 0,Vj £ A/", sample Ci(l) from the uniform 
distribution on J\f 

2: Phase 1 

3: while b < MN do 

4: if K i{t) {t) (£B Mt) (t) then 

5: B ai(t+1) {t+l)^0{B Mt) (t)L>h aiit) (t)) 

6: b = b + l 

7: end if 

8: Sample <ii(t + 1) from the uniform distribution on A/" 

9: t = t + l 
10: end while 

11: find the socially optimal allocation a* 
12: Phase 2 
13: while b > MN do 
14: if h a . {t) (t) <v*. {t) then 

15: Sample ai(t + 1) from the uniform distribution on J\f 

16: else 

17: <7 % {t + l)=(T l {t) 

18: end if 
19: t = t + l 
20: end while 



Fig. 3. pseudocode of RS 



Theorem 3. Under the assumptions of (C3) if all players use RS algorithm to 
choose their actions, then the expected time to converge to the socially optimal 
allocation is finite. 

Proof. Let Topt denote the time the socially optimal allocation is achieved, Tl 
be the time when all users learn all the IFs, Tp be the time it takes to reach the 
socially optimal allocation after all users learn all the IFs. Then Topt = Tr,+Tp 
and E[T pt] = E[T L ] + E[T F \. Wc will bound E[T L ] and E[T F \. Let T % be the 
first time that i users have learned the IFs. Let Ti = Ti — Tj_i, i = 1, 2, . . . , M 
and Tq = 0. Then Tl = T\ + . . ■ + Tm- Define a Markov chain over all N M pos- 
sible configurations of M users over N channels based on the randomization of 
the algorithm. This Markov chain has a time dependent stochastic matrix which 
changes at times T\,T%, . . . , Tm- Let Pp a , Pt x ,■■■ , Pt m denote the stochastic ma- 
trices after the times To, Ti, . . . , Tm respectively. This Markov chain is irreducible 
at all times up to Tm and is reducible with absorbing states corresponding to 
the socially optimal allocations after Tm- Let Ti,T2,...Tm be the times that 
all configurations are visited when the Markov chain has stochastic matrices 
Pt , -Pji j • • • ] Ptm-i respectively. Then because of irrcducibility and finite states 
E[Ti] < z\, i = 1, . . . , M for some constant z\ > . Since r, < Ti,i = 1, . . . ,M 
a.s. wc have E[Tl] < Mz%. For the Markov chain with stochastic matrix Pt m 
all the configurations that do not correspond to the socially optimal allocation 
are transient states. Since starting from any transient state the mean time to 
absorption is finite E[Tp] < z%, for some constant Z2 > 0. □ 

6 Conclusion 

In this paper we studied the decentralized multiuser resource allocation problem 
with various levels of communication and cooperation between the users. Under 
three different scenarios we proposed three algorithms with reasonable perfor- 
mance. Our future reserach will include characterization of achievable perfor- 
mance regions for these scenarios. For example, in case 2 we are interested in 
finding an optimal algorithm and a lower bound on the performance. 
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